feat: implement performance-based peer selection for snap sync (#9722)#9781
feat: implement performance-based peer selection for snap sync (#9722)#9781bcoste wants to merge 3 commits intohyperledger:mainfrom
Conversation
…ledger#9722) - Add peer performance metrics (latency, throughput) to EthPeer and EthPeerImmutableAttributes - Capture latency and throughput in AbstractPeerRequestTask - Add performance-based comparators (BY_LOW_LATENCY, BY_TRANSFER_SPEED) to EthPeers - Update AbstractRetryingSwitchingPeerTask to support custom comparators - Integrate performance-based selection into Snap Sync tasks and RequestDataStep - Add comprehensive unit tests for peer selection and metrics capturing Signed-off-by: Bruno Coste <bcoste@gmail.com>
93287d1 to
fed6ff2
Compare
|
@jframe Thanks for the guidance! I understand the importance of mainnet testing for validating this approach. However, due to disk space constraints on my local machine, I'm unable to run full mainnet comparison tests at this stage. My PR provides:
I believe this is a solid foundation for the feature, and the actual mainnet performance comparison could be:
Would this approach work for merging the initial implementation? I'm happy to adjust the implementation based on any feedback during review. |
Thanks @bcoste that approach will work. Appreciate that it's difficult to test on mainnet. I will finish my review of the PR and then when it's ready I help out testing syncing on mainnet with this PR. |
|
I just enabled the build on this PR and it's failing on some errors. There is a formatting issue. We run spotless to ensure code consistency. If you run Also there is a compile error on More details in the failing acceptance test https://github.com/hyperledger/besu/actions/runs/22042867978/job/63720174569?pr=9781. |
Signed-off-by: Bruno Coste <bcoste@gmail.com>
|
@jframe I've pushed the fixes. The PR should be ready for the CI checks to run again. Thanks! |
There are some failing unit tests. Do the unit tests pass locally with |
Add null checks for peers in AbstractPeerRequestTask before recording performance metrics, preventing NullPointerExceptions in unit tests. Signed-off-by: Bruno COSTE <bcoste@gmail.com>
@jframe I've pushed the fix for the NullPointerException in Regarding the two failures in
These tests pass successfully when run individually on my machine, but fail during the full Could you trigger a fresh CI run on GitHub? That will be the most reliable way to validate the build in a standard environment. Thanks! |
|
Tests are passing again after triggering CI. Running mainnnet syncs to test the performance of your changes. I've deployed 2 mainnet nodes with Bonsai and Snap with your change and two without as a control. |
|
Unfortunately I'm not seeing any significant difference in the world state. The world state download was slightly faster but within noise levels.
|
| private final AtomicInteger lastProtocolVersion = new AtomicInteger(0); | ||
|
|
||
| private volatile long lastRequestTimestamp = 0; | ||
| private static final double ALPHA = 0.1; |
There was a problem hiding this comment.
I think this needs to be configurable as an experimental option as don't know what alpha value makes sense for mainnet.
|
|
||
| private volatile long lastRequestTimestamp = 0; | ||
| private static final double ALPHA = 0.1; | ||
| private volatile double averageLatencyMs = -1.0; |
There was a problem hiding this comment.
Making the initial value the worst case may mean that some peers never get selected and get sampled.
|
Intuitively the approach makes sense. But in the testing so far I'm not seeing any measurable impact. I think some more detailing logging and or metrics are needed to understand the peer selection behaviour. Also configuring the alpha for EMA so different values could be tested. It also might be the just peer selection based on these values is not enough. The approach might need to be broadened to incorporate the peer scoring as well. There will be have to be a lot of testing and with different approaches until we know the correct approach to take. |
PR description
Implement performance-based peer selection for Besu snap sync to improve world state sync performance by routing requests to peers with optimal latency and throughput characteristics.
Fixed Issue(s)
Fixes #9722
Changes
Testing
./gradlew spotlessApply./gradlew build./gradlew acceptanceTestThanks for sending a pull request! Have you done the following?
doc-change-requiredlabel to this PR if updates are required.